Prediction with SVMs over SLM.

In this notebook I confirm that the good performance that I get with the letters does not depend on some quirk involved with the fact that the letters (that is the targers) are represented as text and not as number

Libraries and files


In [1]:
import numpy as np
import h5py
from sklearn import svm, cross_validation, preprocessing

In [2]:
# First we load the file 
file_location = '../results_database/text_wall_street_big.hdf5'
run_name = '/low-resolution'
f = h5py.File(file_location, 'r')

# Now we need to get the letters and align them
text_directory = '../data/wall_street_letters.npy'
letters_sequence = np.load(text_directory)
Nletters = len(letters_sequence)
symbols = set(letters_sequence)

Now we transform the letters into number with a dictionary


In [3]:
symbol_to_number = {}

for number, symbol in enumerate(symbols):
    symbol_to_number[symbol] = number

letters_sequence = [symbol_to_number[letter] for letter in letters_sequence]

Load nexa with its parameters


In [4]:
# Nexa parameters
Nspatial_clusters = 5
Ntime_clusters = 15
Nembedding = 3

parameters_string = '/' + str(Nspatial_clusters)
parameters_string += '-' + str(Ntime_clusters)
parameters_string += '-' + str(Nembedding)

nexa = f[run_name + parameters_string]

Now we make the predictions


In [5]:
delay = 4
N = 5000
cache_size = 1000

In [6]:
# Exctrat and normalized SLM
SLM = np.array(f[run_name]['SLM'])

print('Standarized')
X = SLM[:,:(N - delay)].T
y = letters_sequence[delay:N]
# We now scale X
X = preprocessing.scale(X)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.10)

clf_linear = svm.SVC(C=1.0, cache_size=cache_size, kernel='linear')
clf_linear.fit(X_train, y_train)
score = clf_linear.score(X_test, y_test) * 100.0
print('Score in linear', score)

clf_rbf = svm.SVC(C=1.0, cache_size=cache_size, kernel='rbf')
clf_rbf.fit(X_train, y_train)
score = clf_rbf.score(X_test, y_test) * 100.0
print('Score in rbf', score)

print('Not standarized')
X = SLM[:,:(N - delay)].T
y = letters_sequence[delay:N]

# We now scale X
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.10)

clf_linear = svm.SVC(C=1.0, cache_size=cache_size, kernel='linear')
clf_linear.fit(X_train, y_train)
score = clf_linear.score(X_test, y_test) * 100.0
print('Score in linear', score)

clf_rbf = svm.SVC(C=1.0, cache_size=cache_size, kernel='linear')
clf_rbf.fit(X_train, y_train)
score = clf_rbf.score(X_test, y_test) * 100.0
print('Score in rbf', score)


Standarized
Score in linear 98.8
Score in rbf 97.6
Not standarized
Score in linear 99.6
Score in rbf 99.6